{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Factorization machine implemented in PyTorch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0",
    "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a",
    "collapsed": true
   },
   "source": [
    "Hi! In this tutotial I want to discuss libFFM algorithm and share my implementation of this algorithm in PyTorch. This tutorial should be considered as an extension to already published tutorial [\"Factorization Machines\" (Russian)](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_russian/tutorials/factorization_machines_sygorbatyuk.ipynb), however, my goal is different: to show that given a paper with mathematical description of a model and PyTorch we can easily implement the model all by ourselves."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The original context for creation of factorization machines was recommender system: for instance, recommend a movie to a customer based on his ratings to other movies. However, I want to take a look on this algorithm from another point view: Factorization Machines can be considered as an extension of linear model that additionally incorporate information about features interactions (in an efficient way). In linear models we just compute a weighted sum of predictors and do not take into account interactions among features e.g. $x_i^{(k)} x_j^{(k)}$ for two features, where $i, j$ - feature indices and $k$ is an index of object in the trainset. However, the number of pairwise interactions scales quadratically with the number of features, so for 1000 features we get 1000000 interactions. Needless to say, it is computationally inefficient to compute weights for each interactions, moreover a model with a large number of parameters is prone to overfitting. Factorization Machines use an elegant trick: find vectors for each feature and compute interaction weight as a dot product of two those features i.e. we **factorize** interactions weight matrix $W \\in \\mathbb{R}^{n \\times n}$ as a product $VV^{^T}$, where $V \\in \\mathbb{R}^{n \\times k}$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The FM model definition"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The factorization machine with pairwise interactions is defines as:\n",
    "$$\\hat{y}(x) = w_0 + \\sum_{i=1}^{n}w_i x_i + \\sum_{i=1}^{n} \\sum_{j=i+1}^{n} \\langle \\textbf v_i, \\textbf v_j \\rangle x_i x_j$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first two terms is just a linear model (or, in the Deep Learning lingo, a linear layer). The last term can be expressed as:\n",
    "$$\n",
    "\\sum_{i=1}^{n} \\sum_{j=i+1}^{n} \\langle \\textbf v_i, \\textbf v_j \\rangle x_i x_j =\n",
    "\\frac{1}{2} \\sum_{f=1}^{k} \\Big( \\big(\\sum_{i=1}^{n} v_f^{(i)} x_i \\big)^2 - \\sum_{i=1}^{n}v_f^{(i) 2} x_i^2 \\Big) = \n",
    "\\frac{1}{2} \\sum_{f=1}^{} \\Big( S_{1,f}^2 - S_{2,f} \\Big) =\n",
    "\\frac{1}{2} \\Big( S_{1}^2 - S_{2} \\Big),\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where I used $S_1$ and $S_2$ for clarity. [](http://)Suppose we have $M$ training objects, $n$ features and we want to factorize feature interaction with vectors of size $k$ i.e. dimensionality of $v_i$. Let us denote our trainset as $X \\in \\mathbb{R}^{M \\times n}$ , and matrix of $v_i$ (the ith row is $v_i$) as  $V \\in \\mathbb{R}^{n \\times k}$. Also let's denote feature vector for the jth object as $\\textbf x_j$. So:<br><br>\n",
    "$$\n",
    "X = \\begin{bmatrix}\n",
    "x_1^{(1)} & \\dots & x_n^{(1)}\\\\\n",
    " \\vdots \\ & \\ddots \\ & \\vdots \\\\ \n",
    "x_1^{(M)} & \\dots & x_n^{(M)} \\\\\n",
    "\\end{bmatrix}\n",
    "$$\n",
    "<br><br>\n",
    "$$\n",
    "V = \\begin{bmatrix}\n",
    "v_1^{(1)} & \\dots & v_k^{(1)}\\\\\n",
    " \\vdots \\ & \\ddots \\ & \\vdots \\\\ \n",
    "v_1^{(n)} & \\dots & v_k^{(n)} \\\\\n",
    "\\end{bmatrix}\n",
    "$$\n",
    "<br>\n",
    "The number in brackets indicate the index of the sample for $x$ and the index of feature for $v$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us rewrite the formula above in matrix form. Our final result should be the matrix of size $M \\times 1$. We clearly see $S_1 = \\sum_{i=1}^{n} v_f^{(i)} x_i $ is a dot product of feature vector  $\\textbf x_j$ (a row of $X$) and the ith column of $V$. If we multiply $X$ and $V$, we get: <br><br>\n",
    "$$\n",
    "XV = \\begin{bmatrix}\n",
    "\\sum_{i=1}^{n} v_f^{(1)} x_i^{(1)}  & \\dots &  \\sum_{i=1}^{n} v_f^{(k)} x_i\\\\\n",
    " \\vdots \\ & \\ddots \\ & \\vdots \\\\ \n",
    "\\sum_{i=1}^{n} v_f^{(1)} x_i^{(M)} & \\dots & \\sum_{i=1}^{n} v_f^{(k)} x_i^{(M)} \\\\\n",
    "\\end{bmatrix} = \n",
    "\\begin{bmatrix}\n",
    "S_{1,1}^{(1)}  & \\dots &  S_{1,k}^{(1)}\\\\\n",
    " \\vdots \\ & \\ddots \\ & \\vdots \\\\ \n",
    "S_{1,1}^{(M)}  & \\dots & S_{1,k}^{(M)} \\\\\n",
    "\\end{bmatrix}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hm, looks pretty good. So if square $XV$ element-wise and then find sum of each row, we obtain vector of $S_1^2$ terms for each training sample. Also, if we first square $X$ and $V$ element-wise, then multiply them and finally sum by rows,  we'll get $S_2$ term for each training object. So, conceptually, we can express the final term like this:\n",
    "$$\n",
    "\\hat{\\textbf{y}}(X) = \\frac{1}{2} ( square(XV) - (square(X) \\times square(V) )).sum(rowwise),\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's translate it into PyTorch model!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## PyTorch model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
    "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5"
   },
   "outputs": [],
   "source": [
    "import numpy as np # linear algebra\n",
    "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n",
    "import random\n",
    "\n",
    "from sklearn.model_selection import KFold\n",
    "from sklearn.metrics import roc_auc_score\n",
    "\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "from torch.utils.data import TensorDataset, DataLoader\n",
    "\n",
    "import os\n",
    "import copy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "DEVICE = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will test our model on data from recently ended (not really ended, but you know what I mean) [mlcourse.ai: Dota 2 Winner Prediction](https://www.kaggle.com/c/mlcourse-dota2-win-prediction). We won't use all features, only binary indicators of hero_ids for each team. We will try to find if there's any \"synergy\" among pairs of heroes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have to do two things: add a linear layer and define all matrix operations from the expression above. Addition of a linear layer in straightforward. As to factorization part, we shouldn't forget to register $V$ as a parameter of our model with `nn.Parameter` (otherwise, we won't be able to learn optimal $V$ with gradient descent). What is good about PyTorch that we don't have to bother with finding a derivative of our expression, PyTorch will do that for us!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "class TorchFM(nn.Module):\n",
    "    def __init__(self, n=None, k=None):\n",
    "        super().__init__()\n",
    "        # Initially we fill V with random values sampled from Gaussian distribution\n",
    "        # NB: use nn.Parameter to compute gradients\n",
    "        self.V = nn.Parameter(torch.randn(n, k),requires_grad=True)\n",
    "        self.lin = nn.Linear(n, 1)\n",
    "\n",
    "        \n",
    "    def forward(self, x):\n",
    "        out_1 = torch.matmul(x, self.V).pow(2).sum(1, keepdim=True) #S_1^2\n",
    "        out_2 = torch.matmul(x.pow(2), self.V.pow(2)).sum(1, keepdim=True) # S_2\n",
    "        \n",
    "        out_inter = 0.5*(out_1 - out_2)\n",
    "        out_lin = self.lin(x)\n",
    "        out = out_inter + out_lin\n",
    "        \n",
    "        return out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You see it's not that much of code. Let's try our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load train data\n",
    "train_df = pd.read_csv('data/dota_train_binary_heroes.csv', index_col='match_id_hash')\n",
    "test_df = pd.read_csv('data/dota_train_binary_heroes.csv', index_col='match_id_hash')\n",
    "target = pd.read_csv('data/train_targets.csv', index_col='match_id_hash')\n",
    "y = target['radiant_win'].values.astype(np.float32)\n",
    "y = y.reshape(-1,1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# convert to 32-bit numbers to send to GPU \n",
    "X_train = train_df.values.astype(np.float32)\n",
    "X_test = test_df.values.astype(np.float32)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to train our model we have to define several functions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To compute probalities\n",
    "def sigmoid(x):\n",
    "    return 1 / (1 + np.exp(-x))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# for reproducibility\n",
    "def seed_everything(seed=1234):\n",
    "    random.seed(seed)\n",
    "    os.environ['PYTHONHASHSEED'] = str(seed)\n",
    "    np.random.seed(seed)\n",
    "    torch.manual_seed(seed)\n",
    "    torch.cuda.manual_seed(seed)\n",
    "    torch.backends.cudnn.deterministic = True\n",
    "seed_everything()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# main training function\n",
    "def train_mlp(X, X_test, y, folds, model_class=None, model_params=None, batch_size=128, epochs=1,\n",
    "              criterion=None, optimizer_class=None, opt_params=None,\n",
    "#               clr=cyclical_lr(10000),\n",
    "              device=None):\n",
    "    \n",
    "    seed_everything()\n",
    "    models = []\n",
    "    scores = []\n",
    "    train_preds = np.zeros(y.shape)\n",
    "    test_preds = np.zeros((X_test.shape[0], 1))\n",
    "    \n",
    "    X_tensor, X_test, y_tensor = torch.from_numpy(X).to(device), torch.from_numpy(X_test).to(device), torch.from_numpy(y).to(device)\n",
    "    for n_fold, (train_ind, valid_ind) in enumerate(folds.split(X, y)):\n",
    "        \n",
    "        print(f'fold {n_fold+1}')\n",
    "        \n",
    "        train_set = TensorDataset(X_tensor[train_ind], y_tensor[train_ind])\n",
    "        valid_set = TensorDataset(X_tensor[valid_ind], y_tensor[valid_ind])\n",
    "        \n",
    "        loaders = {'train': DataLoader(train_set, batch_size=batch_size, shuffle=True),\n",
    "                   'valid': DataLoader(valid_set, batch_size=batch_size, shuffle=False)}\n",
    "        \n",
    "        model = model_class(**model_params)\n",
    "        model.to(device)\n",
    "        best_model_wts = copy.deepcopy(model.state_dict())\n",
    "        \n",
    "        optimizer = optimizer_class(model.parameters(), **opt_params)\n",
    "#         scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, [clr])\n",
    "        \n",
    "        # training cycle\n",
    "        best_score = 0.\n",
    "        for epoch in range(epochs):\n",
    "            losses = {'train': 0., 'valid': 0}\n",
    "            \n",
    "            for phase in ['train', 'valid']:\n",
    "               \n",
    "                if phase == 'train':\n",
    "                    model.train()\n",
    "                else:\n",
    "                    model.eval()\n",
    "                \n",
    "                for batch_x, batch_y in loaders[phase]:\n",
    "                    optimizer.zero_grad()\n",
    "                    out = model(batch_x)\n",
    "                    loss = criterion(out, batch_y)\n",
    "                    losses[phase] += loss.item()*batch_x.size(0)\n",
    "                    \n",
    "                    with torch.set_grad_enabled(phase == 'train'):\n",
    "                        if phase == 'train':\n",
    "                            loss.backward()\n",
    "#                             scheduler.step()\n",
    "                            optimizer.step()\n",
    "\n",
    "                losses[phase] /= len(loaders[phase].dataset)\n",
    "            \n",
    "            # after each epoch check if we improved roc auc and if yes - save model\n",
    "            with torch.no_grad():\n",
    "                model.eval()\n",
    "                valid_preds = sigmoid(model(X_tensor[valid_ind]).cpu().numpy())\n",
    "                epoch_score = roc_auc_score(y[valid_ind], valid_preds)\n",
    "                if epoch_score > best_score:\n",
    "                    best_model_wts = copy.deepcopy(model.state_dict())\n",
    "                    best_score = epoch_score\n",
    "            \n",
    "            if ((epoch+1) % 30) == 0:\n",
    "                print(f'epoch {epoch+1} train loss: {losses[\"train\"]:.3f} valid loss {losses[\"valid\"]:.3f} valid roc auc {epoch_score:.3f}')\n",
    "        \n",
    "        # prediction on valid set\n",
    "        with torch.no_grad():\n",
    "            model.load_state_dict(best_model_wts)\n",
    "            model.eval()\n",
    "            \n",
    "            train_preds[valid_ind] = sigmoid(model(X_tensor[valid_ind]).cpu().numpy())\n",
    "            fold_score = roc_auc_score(y[valid_ind], train_preds[valid_ind])\n",
    "            scores.append(fold_score)\n",
    "            print(f'Best ROC AUC score {fold_score}')\n",
    "            models.append(model)\n",
    "\n",
    "            test_preds += sigmoid(model(X_test).cpu().numpy())\n",
    "    \n",
    "    print('CV AUC ROC', np.mean(scores), np.std(scores))\n",
    "    \n",
    "    test_preds /= folds.n_splits\n",
    "    \n",
    "    return models, train_preds, test_preds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "folds = KFold(n_splits=5, random_state=17)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since our kernel is just a proof of concept, we won't optimize hyperparameters and set high learning rate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "fold 1\n",
      "epoch 30 train loss: 1.183 valid loss 1.244 valid roc auc 0.521\n",
      "epoch 60 train loss: 0.790 valid loss 0.848 valid roc auc 0.545\n",
      "epoch 90 train loss: 0.710 valid loss 0.763 valid roc auc 0.560\n",
      "epoch 120 train loss: 0.683 valid loss 0.734 valid roc auc 0.568\n",
      "epoch 150 train loss: 0.670 valid loss 0.721 valid roc auc 0.572\n",
      "epoch 180 train loss: 0.663 valid loss 0.714 valid roc auc 0.575\n",
      "epoch 210 train loss: 0.658 valid loss 0.710 valid roc auc 0.577\n",
      "epoch 240 train loss: 0.655 valid loss 0.708 valid roc auc 0.578\n",
      "epoch 270 train loss: 0.652 valid loss 0.707 valid roc auc 0.578\n",
      "epoch 300 train loss: 0.650 valid loss 0.706 valid roc auc 0.578\n",
      "Best ROC AUC score 0.5784996096089556\n",
      "fold 2\n",
      "epoch 30 train loss: 1.179 valid loss 1.220 valid roc auc 0.528\n",
      "epoch 60 train loss: 0.785 valid loss 0.830 valid roc auc 0.553\n",
      "epoch 90 train loss: 0.708 valid loss 0.750 valid roc auc 0.567\n",
      "epoch 120 train loss: 0.683 valid loss 0.722 valid roc auc 0.575\n",
      "epoch 150 train loss: 0.671 valid loss 0.710 valid roc auc 0.580\n",
      "epoch 180 train loss: 0.664 valid loss 0.703 valid roc auc 0.584\n",
      "epoch 210 train loss: 0.660 valid loss 0.700 valid roc auc 0.586\n",
      "epoch 240 train loss: 0.657 valid loss 0.698 valid roc auc 0.588\n",
      "epoch 270 train loss: 0.655 valid loss 0.696 valid roc auc 0.589\n",
      "epoch 300 train loss: 0.653 valid loss 0.695 valid roc auc 0.590\n",
      "Best ROC AUC score 0.5901286682631814\n",
      "fold 3\n",
      "epoch 30 train loss: 1.143 valid loss 1.230 valid roc auc 0.509\n",
      "epoch 60 train loss: 0.773 valid loss 0.841 valid roc auc 0.536\n",
      "epoch 90 train loss: 0.702 valid loss 0.759 valid roc auc 0.553\n",
      "epoch 120 train loss: 0.679 valid loss 0.730 valid roc auc 0.563\n",
      "epoch 150 train loss: 0.668 valid loss 0.717 valid roc auc 0.569\n",
      "epoch 180 train loss: 0.662 valid loss 0.710 valid roc auc 0.573\n",
      "epoch 210 train loss: 0.658 valid loss 0.706 valid roc auc 0.575\n",
      "epoch 240 train loss: 0.656 valid loss 0.703 valid roc auc 0.576\n",
      "epoch 270 train loss: 0.654 valid loss 0.702 valid roc auc 0.577\n",
      "epoch 300 train loss: 0.652 valid loss 0.701 valid roc auc 0.577\n",
      "Best ROC AUC score 0.577274599275615\n",
      "fold 4\n",
      "epoch 30 train loss: 1.163 valid loss 1.220 valid roc auc 0.521\n",
      "epoch 60 train loss: 0.780 valid loss 0.829 valid roc auc 0.543\n",
      "epoch 90 train loss: 0.706 valid loss 0.749 valid roc auc 0.559\n",
      "epoch 120 train loss: 0.682 valid loss 0.721 valid roc auc 0.570\n",
      "epoch 150 train loss: 0.670 valid loss 0.709 valid roc auc 0.576\n",
      "epoch 180 train loss: 0.664 valid loss 0.703 valid roc auc 0.580\n",
      "epoch 210 train loss: 0.660 valid loss 0.699 valid roc auc 0.582\n",
      "epoch 240 train loss: 0.657 valid loss 0.697 valid roc auc 0.584\n",
      "epoch 270 train loss: 0.655 valid loss 0.695 valid roc auc 0.584\n",
      "epoch 300 train loss: 0.653 valid loss 0.695 valid roc auc 0.585\n",
      "Best ROC AUC score 0.584847163764615\n",
      "fold 5\n",
      "epoch 30 train loss: 1.170 valid loss 1.238 valid roc auc 0.521\n",
      "epoch 60 train loss: 0.781 valid loss 0.842 valid roc auc 0.543\n",
      "epoch 90 train loss: 0.705 valid loss 0.760 valid roc auc 0.556\n",
      "epoch 120 train loss: 0.679 valid loss 0.732 valid roc auc 0.564\n",
      "epoch 150 train loss: 0.668 valid loss 0.719 valid roc auc 0.568\n",
      "epoch 180 train loss: 0.661 valid loss 0.713 valid roc auc 0.570\n",
      "epoch 210 train loss: 0.657 valid loss 0.710 valid roc auc 0.571\n",
      "epoch 240 train loss: 0.654 valid loss 0.708 valid roc auc 0.572\n",
      "epoch 270 train loss: 0.652 valid loss 0.707 valid roc auc 0.573\n",
      "epoch 300 train loss: 0.651 valid loss 0.706 valid roc auc 0.573\n",
      "Best ROC AUC score 0.5726955539648453\n",
      "CV AUC ROC 0.5806891189754424 0.006111373083841393\n",
      "CPU times: user 10min 32s, sys: 7.45 s, total: 10min 39s\n",
      "Wall time: 10min 40s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "MS, train_preds, test_preds = train_mlp(X_train, X_test, y, folds, \n",
    "                            model_class=TorchFM, \n",
    "                            model_params={'n': X_train.shape[1], 'k': 5}, \n",
    "                            batch_size=1024,\n",
    "                            epochs=300,\n",
    "                            criterion=nn.BCEWithLogitsLoss(),\n",
    "                            optimizer_class=torch.optim.SGD, \n",
    "                            opt_params={'lr': 0.01, 'momentum': 0.9},\n",
    "                            device=DEVICE\n",
    "                            )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I certainly wouldn't call it a good model, but at least it works. We see that there's indeed some link between teams composition and winning in the match."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial we made our own Factorization Machine with a pinch of linear algebra and autograd magic of PyTorch.\n",
    "\n",
    "Good luck with your own experiments!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sources"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Original paper by Rendle https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}